ROME  MR  DEVELOPMENT  CENTER 
MR  FORCE  SYSTEMS  COMMAND 
6RIFRSS  AIR  FORCE  BASE.  NEW  YORK  13441 


Polytechnic  Institute  of  New  York 


/ 

RADC-TR-77-1 5 
Technical  Report 
January  1977 


SEEDING/TAGGING  ESTIMATION  OF  SOFTWARE  ERRORS: 
MODELS  AND  ESTIMATES 


✓ D D C 


Approved  for  public  release; 
distribution  unlimited. 


This  report  has  been  reviewed  by  the  RADC  Information  Office  (01)  and 
is  releasable  to  the  National  Technical  Information  Service  (NTIS).  At  NTIS 
it  will  be  releasable  to  the  general  public  including  foreign  nations. 

This  report  has  been  reviewed  and  is  approved  for  publication. 


APM0VEDi  aJL.  n.  &Jk$r 

ALAN  N.  SUKERT,  Capt,  USAF 
Project  Engineer 


APPROVED: 


ALAN  R.  BARNUM 

Assistant  Chief,  Information  Sciences  Division 


FOR  THE  COMMANDER 


JOHN  P.  HUSS 

Acting  Chief,  Plans  Office 


Do  not  return  this  copy.  Retain  or  destroy. 


UNCLASSIFIED  


EECURITY  CLASSIFICATION  OF  THIS  PAGE  (Wl«l  D»l»  Entil'd) 


REPORT  DOCUMENTATION  PAGE 


2.  GOVT  ACCESS  I 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


4.  TITLE  (mid  Subtill') 


^EEDING/TACGINC  ESTIMATION  OF  SOFTWARE  ERRORS  1/ 
JJODELS  AND  ESTIMATES  * i ■ ■ ' 


ffTYPE  OF  REPO**  » PEPIOO  COVER 

ImtmxTm  Repo*t/ 

1 Apr  74  - 30  Jun  76 J 


» PERFORMING  ORGANIZATION  NAME  AND  AOORESJ 

Polytechnic  Institute  of  New  York 

10.  PROGRAM  ELEMENT,  PROJECT,  TASK 
AREA  A WORK  UNIT  NUMBERS 

333  Jay  St 

62702F 

Brooklyn  NY  11201 

— — 

55500806 

II.  CONTROLLING  OFFICE  NAME  ANO  ADDRESS 

Rome  Air  Development  Center  (ISIS) 
Crlffiss  AFB  NT  13441 


IS.  security  CLASS.  ( • I thla  report) 

UNCLASSIFIED 


1S«.  OECLASSIPICATION/DOWnORADINO 

^schedule 


17.  DISTRIBUTION  STATEMENT  (ol  the  ebatrect  entered  In  Bloek  20,  II  dlllerent  Irom  Report) 

Same 


IS.  SUPPLEMENTARY  NOTES 

RADC  Project  Engineers  Capt  Alan  N.  Suker t (ISIS) 


If.  KEY  WOROS  (Continue  on  reveree  elde  II  neeeeeary  end  Identify  by  bloek  number) 

Software  Errors 
Seeding/Tagging  Estimates 
Error  Seeding 
Error  Tagging 

Software  Modeling  


ABSTRACT  ( Continue  on  reveree  tide  II  necettery  end  Identity  by  bloek  numbet) 
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Abstract 


Seeding/tagging  estimates  of  the  number  of  software  errors  are  computed 
from  s,t  and  c where:  t is  the  number  of  error s either  inserted  deliberately 
in  a program  (seeded)  or  found  by  debugging  (tagged);  s is  the  number  found 
by  a debugger  unaware  of  the  contents  of  the  first  set;  and  c is  the  number 
appearing  in  both  sets. 

Two  types  of  questions  can  be  raised.  One  type  relates  to  the  method 
and  procedure:  the  introduction  of  new  errors,  the  changing  of  a program 
by  debugging,  etc.  The  other  relates  to  possible  estimates,  and  their 
evaluation  and  comparison.  This  report  concerns  itself  with  questions  of 
the  second  type.  Estimates  based  on  3 models  are  discussed.  The  models 
are  defined  by  assumptions  regarding  the  equal  or  unequal  difficulty  of  un- 
covering individual  errors.  Model  1 assumes  all  errors  equally  open  to 
discovery  at  all  times.  Models  2 and  3 assume  that  categories  of  difficulty 
exist  and  that  any  error  which  appears  can  be  assigned  to  the  proper  cate- 
gory. Model  2 does  not  assume  that  the  relative  distribution  of  errors  in  a 
program  among  categories  is  known,  but  Model  3 does.  Estimates  for 
Models  2 and  3 are  shown  to  be  closely  related  to  those  for  Model  1. 

The  mean  and  mean-squared  error  of  a maximum-likelihood  estimate 
and  a modified  maximum  likelihood  estimate  are  given.  It  is  shown  how 
these  quantities  vary  with  certain  relations  among  the  total  number  of 
errors,  size  of  tagged  or  seeded  set  and  size  of  accompanying  sample  set. 
Curves  are  drawn  which  can  be  used  to  determine  optimum  values  for  s 
and  t and  a procedure  is  outlined  for  doing  so. 

More  precise  estimates  can  be  obtained  with  several  trials  rather 
than  one  as  described  above.  Several  such  estimates  are  examined  and 
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discussed . 


It  is  concluded  in  general  terms  that  a reasonable  investment  of  time 
will  produce  adequate  estimates. 
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1.  0 Introduction 

1.  1 Tanging/ seeding  census  methods 

It  has  been  suggested  that  tagging/ seeding  census  methods,  used  for 
many  years  to  estimate  the  size  of  animal  and  fish  populations,  be  borrowed 
from  the  arsenal  of  wildlife  specialists  for  the  purpose  of  estimating  the 
number  of  bugs  in  a computer  program.  The  initial  seeding  suggestion 
came  from  H.  D.  Mills*;  the  tagging  alternative  was  proposed  by 
M.  Hyman*. 

The  "tagging"  and  "seeding"  labels  are  descriptive  of  two  ways  in 
which  the  marked  individuals  required  by  the  process  are  introduced  into 
the  population.  In  the  tagging  variant,  also  called  a capture-recapture 
census,  a sample  of  the  population  is  captured,  tagged  and  returned;  a 
second  sample,  presumably  containing  some  tagged  individuals,  is  then 
captured.  Under  certain  assumptions  one  can  estimate  the  total  population 
from  the  number  of  animals  in  each  of  the  two  captures,  and  the  number 
recaptured,  i.  e.  , common  to  both.  Seeding  differs  only  in  that  the  initial 
capture  and  release  are  replaced  by  the  procedure  of  adding  other  marked 
individuals  to  the  original  population.  If  a uniform  population  is  assumed, 
the  two  processes  are  statistically  identical:  the  estimates  used  for  the 
total  original  population  in  the  tagging  version  are  used  for  the  augmented 
population  in  the  seeding  version.  It  is  only  necessary,  in  the  latter  case, 
to  subtract  the  number  of  seeded  individuals  from  the  result.  If  uniformity 
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is  not  assumed,  differences  may  arise;  these  are  discussed  in  Chapter  8 in 
the  context  of  application  to  computer  programs. 

1.  2 Application  to  software  errors 

We  can  describe  a process  analogous  to  the  foregoing  animal  census 
method  for  estimating  the  number  of  errors  in  a computer  program  at  any 
point  of  its  debugging  life  beyond  the  initial  phase  of  correcting  compiler-dis- 
covered errors.  Suppose  we  give  the  program  to  two  people  to  debug 
(or  to  continue  debugging)  independently,  arranging  that  there  be  no 
contact  between  them.  Each  person  tabulates  and  corrects  errors  as  they 
appear.  After  an  arbitrary  period  of  time  — which  may  differ  for  the  two 
debuggers  — we  look  at  the  results,  i.  e.  , the  two  sets  of  tabulated  errors. 
Some  errors  will  occur  on  both  lists  and  some  on  only  one.  Consider  one 
set  to  correspond  to  the  animals  first  captured,  tagged  and  released,  and 
the  other  set  to  correspond  to  the  second  capture.  The  errors  common  to 
both  lists  correspond  to  the  tagged  animals  included  in  the  second  capture. 
What  we  have  described  is  a tagging  analogue.  For  the  seeding  variant  we 
would  eliminate  one  debugger  and  instead  insert  an  arbitrary  number  of 
known  errors  into  the  program.  How  many  errors  are  seeded  and  which 
they  are  is  not  known  to  the  remaining  debugger.  Most  of  this  report  is 
written  with  the  tagging  case  in  mind;  however,  translation  to  the  seeding 
case  is  direct,  in  the  manner  described  in  Section  1.  1. 

It  is  hardly  necessary  to  say  that  the  tagging/ s ceding  application  to 
software  errors  raises  more  questions  than  can  be  answered  readily. 

(Some,  in  fact,  apply  with  equal  validity  to  the  original  wildlife  census 
process  and  have  provoked  many  long  discourses  in  the  statistical  journals.  ) 
For  example,  new  errors  may  be  introduced  in  the  course  of  correcting 
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those  found.  Also  two  debuggers  may  correct  errors  differently,  leading 
eventually  to  two  quite  different  programs  and  different  error  counts. 

We  will,  however,  neither  answer  nor  in  fact  examine  all  the  questions 
one  might  ask.  Instead  we  will  start  with  the  rash  assumption  that  basically 
the  process  works.  Our  objective  is  to  describe  various  models  and  to 
evaluate  a collection  of  estimates  which  they  support.  If  the  approach 
described  is  feasible  at  all,  one  wants  to  know  how  good  the  results  are 
likely  to  be,  which  estimates  produce  the  most  accurate  and  precise  results, 
and  under  what  conditions. 

A subsequent  report  will  describe  an  experiment  which  should  reveal 
how  well  the  technique  works,  what  the  problems  are,  perhaps  what  the 
answers  to  some  of  the  questions  are,  and  of  course  how  our  estimates 
compare  if,  indeed,  it  is  possible  to  make  them. 

It  will  be  assumed  here  that  no  new  errors  are  introduced  and  that 
different  debuggers  do  not  change  the  program  in  different  ways  in  cor- 
recting the  same  error. 

1.  3 Notation 

The  following  symbols  will  be  used  uniformly: 

N = total  number  of  errors  initially  present  in  a computer 

program  (i.  e.  , present  when  the  test  is  begun).  N includes 
seeded  errors  if  the  seeding  variant  is  used. 

N = any  estimate  of  N 

t = number  of_tagged  errors;  'Hese  are  all  the  errors  discov- 
ered by  the  first  debugger  (the  tagger)  in  the  tagging  case, 
or  the  number  of  seeded  errors  in  the  seeding  case. 


number  of  sampled  errors;  these  are  all  the  errors  dis- 
covered by  the  second  debugger  (the  sampler)  in  the  tagging 
case;  or  by  the  only  debugger  in  the  seeding  case, 
number  of  errors  common  to  the  tagged  and  sampled  sets; 
i.  e.  , the  number  found  by  both  debuggers  in  the  tagging 
case,  or  the  number  of  seeded  errors  found  by  the  sole 
debugger  in  the  seeding  case. 


1 

— Y c.  = average  of  n values  of  c 
n i=l  1 


V(N)  = E[(N-N)  ] = mean-squared  variation  of  a biased  estimate  Ts 

about  the  true  value  N,  as  distinguished  from 


var  (N)  = E { [N  - E(N)  ] } 


ae(N) 


b(N) 

[x] 

P 


= [ V (N)  ] 2 

a 

= bias  of  estimate  N = E(N)  - N 
= greatest  integer  <x 
E P(0)  = probability  that  c = 0 


Estimates 


N 

N 


= ad  hoc  estimate 

st 

c 


N 

= 

1)  (t+  1) 

1 

c + 1 

N . 
oi 

= the 

.th  . 
i of  seve 

Nli 

= the 

.th 

l of  seve 

2.  0 Model  1 - Equal  Probability  Assumption 

2.  1 Description 

The  simplest  model  on  which  to  base  an  estimate  is  that  which  assumes 
debugging  to  be  completely  random:  that  is,  errors  are  said  to  be  indistin- 
guishable, each  being  found  with  probability  1/N  where  N = number  of 
errors  in  the  program  at  the  time, 

2. 2 Pis  cus  sion 

The  basic  assumption  that  all  errors  have  equal  probability  of  discov- 
ery may  not  reflect  the  facts  of  life  in  computerland.  Although  there  is 
little  hard  data,  the  general  impression  is  that  some  errors  are  easy  to 
find  and  would  quickly  be  turned  up  by  any  debugger  while  some  consistently 
resist  descovery. 

One  can  readily  conceive  of  several  factors  which  may  make  for  vari- 
able difficulty.  For  example:  type  of  instruction;  particular  test  data; 
debugger  technique;  location  (beginning  or  end  of  the  program,  within  a loop, 
hidden  by  other  errors,  etc.  ). 

Individually  some  such  factors  would  cause  underestimation  by  an  esti- 
mate based  on  the  equal  probability  assumption  and  some  would  cause  over- 
estimation. As  an  example  of  the  latter,  variation  due  to  debugger  tech- 
nique would  tend  to  make  the  overlap,  c - - which  appears  in  the  denomina- 
tor of  estimates  --  too  small  because  the  tagger  and  sampler  are,  so  to 
speak,  fishing  in  different  waters.  On  the  other  hand,  a variation  more 
closely  related  to  the  nature  of  the  error  itself  would  cause  an  underestima- 
tion since  the  sampler  would  in  a practical  sense  have  available  only  the 
easier  bugs  and  the  estimate  would  actually  be  of  that  subset. 

It  is  not  known  whether  some  particular  factors  of  this  nature  have  an 
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overriding  effect,  or  whether  all  would  be  well  enough  served  by  throwing 
them  into  the  statistical  mash  of  equal  probability. 


3.  0 Ad  Floe  Estimate  N 


A reasonable  estimate  is  suggested  directly  by  the  equal  probability 

assumption.  We  have  a program  with  N bugs  initially;  let,  for  example,  the 

unknown  number  N be  100.  If  the  first  debugger, the  tagger, finds  t = 20  bugs,  t /N 

or  1/5  of  the  total  errors  are  tagged.  The  second  debugger,  the  sampler, 

finds  s = 25  bugs.  If  the  tagged  and  untagged  bugs  can  be  found  with  equal 

ease  (or  difficulty),  both  should  appear  in  the  s -element  sample  in  about  the 

same  proportion  as  in  the  entire  set  of  errors;  t/N  = c/ s,  where  c is  the 

number  of  tagged  bugs  turning  up  in  the  sample. 

Since  t,  s and  c are  known,  N is  approximately  determined  by  the 

• st 

ratio.  This  is  our  first  estimate:  N = — (or  the  nearest  integer  to  st/c). 

- 25x20 

Suppose  the  25  sampled  bugs  included  6 tagged  bugs.  Then  N = 7 = 83. 

c would  have  to  be  5 to  make  the  ratio  exactly  true  and  the  estimate  exactly 
right. 

In  the  seeding  version,  the  original  errors.  N , would  have  numbered 

80,  t = 20  would  have  been  seeded;  N = N + t = 100  would  have  been  esti- 

x 

mated  by  N = 83  as  above,  and  the  estimate  of  the  original  number  of  errors 

would  have  been  N = N - t = 63. 
x 

The  example  also  illustrates  a concomitant  of  all  estimates  considered, 
which  we  will  call  integer  error;  c = 4,  5,  6 give  respectively  N = 125,  100, 
83;  no  in-between  values  are  possible.  Clearly  the  integer  constraint  on  c 
can  cause  a large  error  in  the  estimate  to  arise  from  a small  --  even  the 
smallest  possible  --  deviation  in  c from  its  "ideal"  value.  Integer  error 
will  be  discussed  again  in  Section  6. 


4.  0 Maximum  likelihood  estimate  N 
o 

4.  1 Distribution  of  data  values 

For  a more  formal  derivation  of  estimates,  we  recognize  that  the 
tagging/ seeding  procedures  outlined  describe  a standard  experiment  in  sam- 
pling without  replacement  [1].  The  collection  of  N errors  is  analogous  to 
an  urn  of  balls,  identical  except  for  color:  t balls  --  the  tagged  errors  -- 
are  red  while  the  N-t  remaining  are  white.  The  debugging  experiment  is 
equivalent  to  having  a blindfolded  sampler  reach  in  and  withdraw  s balls. 
Some  balls  in  the  sampled  group  will  be  red  and  some  white.  The  number 
of  red  balls  sampled  i6  a discrete  random  variable  c which  can  assume 
only  integral,  non-negative  values.  The  probability  that  c will  have  some 
particular  value  c is  given  by  the  hypergeometric  distribution: 

t\/N-tx  /s\/N-s» 

. = 8 ! t ! 

(s-c)!(t-c)!  (N-s-ttc)! 


( (s)(N~a) 

Pfcls  t Ml  - Vc/Vs-c/  _ Vc/Vt-c/  _ s ? t ? (N-s)'.(N-t)! 
( | . ’ ' (Nj  - =N!c!  (8-c)!(t-c)! 


The  mean  and  variance  of  the  distribution  are  respectively  [2]: 
E(c|s,t,N)  = 


var(c  | s,  t,  N)  = — 


(N-s  )(N-t) 
N(N-l) 


The  distribution  is  symmetrical  with  respect  to  sand  t,  implying  reason- 
ably enough,  that  it  does  not  matter  which  debugger  we  call  the  sampler 
and  which  the  tagger,  nor  whether  s or  t is  larger. 

A lower  bound  on  c is  certainly  0.  However,  if  s + t > N,  small 
positive  values  are  impossible.  For  example,  if  55  of  a total  of  100  bugs 
are  tagged,  and  the  sampler  finds  50,  there  must  be  an  overlap  of  at  least 

* The  tilde  will  generally  be  omitted  in  this  report;  it  should  be  clear  from 
the  context  whether  the  random  variable  or  a particular  value  is  intended. 
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5 in  the  tagged  and  sampled  sets;  therefore  P(c  | 50,  55,  100)  = 0 for  c = 0, 

The  distribution  as  given  does  not  exist  for  those  values  since  the  factorial 
of  a negative  number  is  undefined;  however,  if  we  replace  the  factorials  by 
the  corresponding  gamma  functions,  we  do  get  P=0  for  0<c<5. 

Since  the  number  of  common  bugs  can  exceed  neither  the  number  tagged 
nor  the  number  sampled,  the  limits  of  c are  described  by 
max  (0,  s + t - N)  < c < min  (s,t). 

Figure  1 shows  the  distribution  for  different  values  of  s,t  and  N,  con- 
tinuous lines  replacing  point  probabilities  for  readability. 

4.  2 Maximum  Likelihood  Estimate 

In  the  case  at  hand  s and  t are  known  parameters,  c is  experimentally 

determined,  and  the  problem  is  to  estimate  the  unknown  parameter  N.  The 

maximum  likelihood  estimate  N is  shown  in  [1]  to  be 

o 


g £ 

is  essentially  equal  to  the  ad  hoc  estimate  N.  Since  — does  not  exist 

for  c=0.  we  arbitrarily  define  N to  be  2st  when  c = 0.  It  is  a reasonable 

o 

choice  since  it  amounts  to  replacing  c=0  by  c=l/2. 

The  properties  of  Nq  are  derived  and  examined  in  Chapman  [3].  It  is 

shown  there  that  Nq  is  a consistent  but  positively  biased  estimate,  the 

st  • 

bias  and  variance  decreasing  with  increasing  ^ , i.  e.  , with  increasing 
mean  of  the  distribution.  Because  of  the  bias,  the  mean-squared  error 

A A £ 

V(N)  = E [ (N-N)  ] rather  than  the  variance  was  taken  as  a measure  of  disper- 
sion. Both  bias  and  mean-squared  error  have  rather  unwieldy  expressions 

* Consistency  here  means  that  the  estimate  approaches  N in  probability  in 
either  of  two  circumstances:  either  (1)  N increases  while  s /N  and  t /N 

remain  constant  or  (2)  N remains  constant  and  the  product  st  increases. 
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and  Chapman  gives  much  simpler  approximate  forms,  derived  for  st./Nx  10. 

The  work  covered  in  this  report  had  been  substantially  completed  when 
inconsistencies  in  certain  results  led  to  a check  of  the  simplified  formulas 


for  bias  and  dispersion.  It  was  discovered  that  certain  approximations  used 
in  deriving  the  formulas  introduced  serious  error  unless  N,  s and  t were 
actually  very  large  numbers;  the  condition  j-j-  large  was  not  sufficient. 

Table  1 lists  3 cases,  for  all  of  which  st/N  = 13.  33,  giving  percent  error  in 
the  original  approximation  formula  for  E(Nq).  Clearly  the  error  decreases 
as  the  magnitudes  increase.  The  mean- squared  error  formula  shows  an 
even  larger  deviation. 


(st/N  = 13.  3) 
N,  s,  t 

% error  in 
approximate 
formula  for  E(Nq) 

30,  20,  20 

7.  9 

270,  60,  60 

3.  5 

3000,  200,  200 

1.  1 

Table  1.  Percent  error  in  first  approximation  formula  for  E(N^) 
for  several  examples  with  st/N=  13.  33 

The  figure  for  bias  resulting  from  the  approximate  formula  was,  for 
small  N,  not  much  larger  than  the  error,  indicating  that  both  bias  and  dis- 
persion might  actually  be  considerably  lower  than  appeared.  Consequently, 


* There  were  actually  2 sources  of  inconsistency.  Since  Chapman's  ap- 
proach did  not  apply  in  the  multi-trial  case  (Sec.  6)  another  approach  was 
used  and  a new  formula  derived  which,  with  the  number  of  trials  reduced 
to  one,  should  have  given  about  the  same  result  as  was  obtained  with 
Chapman's  formula.  However,  the  figures  for  dispersion  in  the  example 
tested  were  far  different.  The  second  inconsistency  was  noted  when  com- 
parison was  made  with  some  specific  cases  in  a tabulation  containing 
means  and  variances  computed  directly  from  probabilities  [4]. 
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it  was  necessary  to  derive  second-order  approximations  for  bias  and  mean- 
squared  error  which  would  be  more  accurate  than  Chapman's  first-order  ap- 
proximations but  still  simpler  than  the  exact  expressions.  Such  approxima- 
tions would  permit  quick  calculation,  provide  insight  into  the  manner  in  which 
bias  and  dispersion  change  with  changing  parameter  values,  and  facilitate 
comparison  with  other  estimates. 

Two  sets  of  bias  and  mean-squared  error  formulas  were  obtained,  one 
using  the  method  applied  by  Chapman  but  eliminating  the  offending  approxi- 
mations, and  the  other  based  on  a Taylor's  series  expansion  of  — . The 
first  derivation  is  described  in  Appendix  1 and  the  second  in  Appendix  2. 

The  Taylor's  series  approach  was  initially  applied  to  find  the  mean  and 
dispersion  of  estimates  based  on  several  data  values  (see  Sec.  6),  a prob- 
lem to  which  the  Chapman  method  is  not  applicable.  Although  such  was  not 
their  raison  d'etre,  the  resulting  formulas  can  be  used  to  verify  the  calcula- 
tions for  N as  well, 
o 

On  the  basis  of  the  new  approximations,  additional  interesting  informa- 
tion was  obtained  on  the  manner  in  which  bias  and  mean-squared  error 
change  with  the  parameters,  information  which  would  be  useful  in  designing 
an  actual  estimation  effort. 

4.  3 Bias 

The  new  approximation  for  the  expected  value  of  derived  from 
Chapman's  exact  result  is 

E(N  ) = st[o'.+o-,+2a  +6  +...  + (m-  1 ) ! a ] (!) 

O l L o 0!  . m 


where  N + 1 

“l  = (s+1)  (t+1) 


The  requirements  for  accuracy  (see  Appendix  1)  are  the  following: 

1.  Enough  terms  must  be  included  in  the  sum,  which  is  a truncated 
version  of  an  infinite  sum,  to  leave  the  remainder  insignificant.  Four  or 
five  terms  have  been  found  sufficient. 

2.  The  probability  that  c = 0 must  be  very  small.  By  referring  to 
the  examples  of  hypergeometric  distribution  in  Fig.  1,  one  sees  that  this 
occurs  when  the  peak  is  far  from  0,  i.  e.  , when  the  mean  of  the  distribution, 
st/N,  is  large.  In  fact,  common  sense  tells  us  that  large  samples  are 
almost  certain  to  have  elements  in  common;  i.e.,  P(0)  = 0.  st/N  > 3 seems 
to  be  sufficient  for  accuracy  unless  N is  very  large  (in  which  case  the  vari- 
ance of  the  distribution  and  therefore  Pq  is  large). 

An  alternative  form  of  Eq.  (1),  derived  by  simple  manipulations  (see 
Appendix  1 ) is: 


E(No)=N[k1+k2£)  + 2k3£,2  + ...  <‘a> 


where  k,  = 


1 + 1/N 


1 (1  + 1/s)  ( 1+  1/t) 

k _ k L±i/.N  , „ 

K i l-l  (1  + i/s)  (1  + i/t) 


» i = 2,  3,  . . 


The  quantities  k^  are  close  to  1 and  increase  to  1 as  a limit  as  s,  t and  N 
increase.  If  we  set  all  k.  = 1 , we  arrive  at  Chapman's  approximate  formula 

] • 


E(N„)  = N [!♦(-£)♦  2 (-fi-)2  + 


A method  described  in  [5]  for  deriving  the  expected  value  of  a function 
of  a random  variable  bymeans  of  a Taylor 's  series  expansion  was  applied  (see 
Appendix  2)  leading  to 


E(Nq)  = N|Uq(^-)Hq2 


(2) 


!N-s)  (N-t) 
N2 


where  q = 


This  is  subject  to  the  same  caveat  as  Eq,  (1):  truncation  effect  and 

s t 

P ^ 0 are  possible  sources  of  error.  Both  tend  to  show  up  for  small  tt  , 

o 

and  for  large  values  of  N,  s,  t,  i.  e.  , values  for  which  min  (s,  t)  > > ^ • 

A A 

The  bias,  b,  of  an  estimate  N is  defined  by  E(N)  = N + b.  The  quantity 

b b 

of  greatest  interest  is  the  ratio  ^ (or  percent  bias  = x 100%)  since  to 

A 

estimate  N=  100  as  N=  120  is  clearly  a grosser  error  than  to  estimate 
N = 1000  as  1020. 

The  percent  bias  of  N varies  in  3 different  ways:  (1)  with  size  of 

r o 

tagged  and  sampled  sets  relative  to  total  number  of  errors,  quantified  by 

the  ratio  pp  ; (2)  with  the  total  number  of  errors  N;  and  (3)  with  size  of 

IS 

s 

sampled  set  relative  to  size  of  tagged  set,  ~ . The  nature  of  each  variation, 
with  the  other  2 sources  held  constant,  is  considered  next. 

1.  decreases  as  increases,  for  N and  constant. 

The  consistency  of  the  estimate,  shown  by  Chapman,  implies  that  this 

is  so  in  the  limit.  For  finite  N,  s and  t,  Eq.  (la)  shows  that,  variation  in 

N 

E(N  ),  with  N constant,  depends  principally  on  the  — factors  While  the 

O Si 

k factors  increase  with  increasing  s and  t,  all  are  less  than  and  close  to 
1 and  vary  very  little  over  large  changes  in  s and  t.  (See  Fig.  2(a)). 

2.  ^ increases  with  N for  and  p-  fixed. 

IN  iN  t 

In  this  case,  the  only  variation  in  E(N^)  is  with  the  quantities  (see 
Eq.  (la))  which  increase  with  increasing  N under  the  given  conditions. 

The  common  upper  limit  of  the  k/s  is  1 which  occurs  only  for  infinite  s,  t 
and  N.  Chapman's  formula,  which  results  if  all  k^  = 1,  therefore  gives  an 
upper  limit  to  the-  bias  ratio,  holding  for  very  large  N.  (See  Fig.  2(b)). 


~ r " *t*sv  • !►' 


3.  For  ^ and  N fixed,  ^ is  greatest  when  = 1. 

g ^ g 

For  — and  N fixed,  the  product  st  is  fixed,  and  — = i implies 
IS  t 

s = t =fst  , We  can  show  (see  Appendix  4)  that 

1 + i/N  1 + i/N 

(1  + i/s)  (1  + i/t)  - (j+i/^,2 

from  which  it  follows  that  k.,  and  therefore  b/N,  are  maximum  for  s=t. 
(See  Fig.  2(c).  ) 

The  first  property  states  the  unsurprising  fact  that,  given  a particular 
program,  large  samples  produce  accurate  estimates.  The  third  property 
says  that  if,  in  addition,  we  make  s and  t unequal,  we  increase  the  accu- 
racy of  Nq  still  more.  However  in  both  cases,  the  increased  accuracy  is 
paid  for  in  time:  larger  sets  of  errors  take  longer  to  find,  and  s + t in- 
creases as  ^ departs  from  1, 

s t s 

The  second  property  says  that  under  the  same  conditions  of  and  — 

we  get  better  results  for  programs  with  fewer  errors,  e.  g.  by  estimating  N 

s t 

after  some  debugging  has  been  done.  However,  as  N increases,  keeping 

constant  requires  relatively  smaller  samples.  For  N = 1000,  for  example, 

s t s t 

s = t = 100  gives  -yj  = 10,  while  for  N = 250,  we  get  the  same  value  of  j^- 

withs=t=50.  That  is,  in  the  second  case,  100  bugs  must  be  found  while  in 

the  first,  with  N four  times  as  large,  only  200,  or  twice  as  many  bugs  must 

be  found.  If  we  spend  the  same  time  relative  to  N and  find  400  bugs  in  the 

s t 

first  case,  we  increase  ^ by  a factor  of  2 and  decrease  bias  considerably. 
To  sum  up  the  argument,  if  we  keep  the  debugging  time,  as  measured  by 
s + t,  proportional  to  N,  then  Nq  has  smaller  bias  for  large  N.  (See  Fig. 
(2d).  ) 
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4.  4 Mean-squared  Error 

New  approximate  formulas  for  V(Nq),  the  variation  about  the  true  value 
N,  were  derived  using  Chapman's  method  (Appendix  1)  and  the  Taylor's 
series  method  (Appendix  2).  They  are  respectively, 

V(Nq)=  N2  { 1 r-2*!  + (ff  - 2 )«2  + (§  - 4)*3  + ... 

(3) 

+ ' 2 (m-1)!}amH 

where  the  's  are  defined  as  in  Eq.  (1) 
a 


and 

m- 1 _1 

Am-  1 = (m-  1 ) ! V j 

V(Nq)  = N2  [q  (N/st)  + 9q2  (N/st)2  ] (4) 

where  q is  defined  as  in  Eq.  (2). 

An  alternative  form  for  Eq.  (3)  is 

2 

V(Nq)  = N2  [(1  + k2  - 2k j)  + (3k3  - 2k2)  ^ + (1  lk4  - 4k3)(^)  + . . . 

(3a) 

+ (A  ,k  -2  (m-2  ) !k 

m- 1 m m- 1 st  J 

where  the  k's  are  defined  as  in  Eq.  (la). 

The  formulas  hold  under  the  same  conditions  as  the  mean  formulas: 

Pq  = 0,  and  low  truncation  error.  Furthermore  the  same  generalizations 

can  be  made  with  respect  to  the  variation  of  V(Nq)  with  N,  s,  t. 
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5.  0 Modified  Maximum  Likeiiiiood  Estimate  N ^ 

5.  1 Bias  and  Mean-squared  Error 

An  intermediate  result  in  Chapman's  derivation  of  E(Nq), 


E ( 


1 


) = 


N + 1 


c+1  ' (s+1)  (t+1) 


(1-K),  where  K = 


N-s  -t 

s , P for  s+t  < N 
N + J o 

0 otherwise 


suggests  the  modified  estimate 

N - lillUilli  . , 

1 c+l 

as  a means  of  reducing  the  bias  to  practically  zero  assuming  = 0.  For 


E(Nj ) = (s+1)  (t+1)  E(^j)  - 1 = (N+l)  (1-K)  - 1 

E(N  ) = N-K(N+ 1 ) where  K=  0 if  P =0 
1 o 

. # . E( Nj)  = N 

• s t 

The  bias  is  negative  but  very  small  even  for  small  -rr-  . Consider  for 

JN 

S t 

example,  the  case  N = 6,  s = 2,  t=3,  with  = 1.  E/Nj),  computed  exactly, 
is  5.  8,  and  b/N  is  3.  3%  whereas  E(Nq)  is  6.  6 with  b/N  = 10%. 

However,  for  as  for  N b/N  increases  if  st/N  is  held  fixed  but  N 

increases.  If  N=20,  s = 4,  t=5,  ~ is  still  1 but  E(Nj)  is  now  16.  9 and 

b/N  = 15.  5%. 

An  additional  advantage  of  N^  is  the  fact  that  its  variation  about  N is 

somewhat  lower  than  V(Nq)  for  N greater  than  about  50.  Below  50,  V(N  ) 

is  smaller.  The  second-order  approximation  for  V(Nj),  under  the  same 

approximation  rules  as  E(N  ) and  V(N  ) is  (see  Appendix  1) 

o o r r 

2 2 7 

V(N j )-  (s+  1 ) (t+1)  [«£  + °'3  + 2a4  + + • ■ ■ + (m-2)  ! a ] - (N+  lp  (5) 
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or 


>2  a . , _N_  \2  r /lr  _v^4.w  , JL)  + 2k,  (^r)2  + . . . 


V(Nj)=  (s+1)  (t+1)  ( ) [ (^2  '^1  ^ +k3 


st 


4 ' st 


(5a) 


Some  comparative  figures  for  Nq  and  Nj  are  shown  in  Table  2 and  in 
Table  3 of  Section  6.  3.  The  variation  of  ffe(Nj)/N  with  relations  among  the 
parameters,  as  described  in  detail  in  Sec.  4.  3,  is  plotted  in  Fig.  3. 


5.  2 Useful  Range 

It  is  obviously  possible  to  make  accurate  and  precise  estiinates  with 
large  enough  samples;  the  limiting  case  of  s=t  = N produces  a perfect  esti- 
mate. Whether  a good  estimate  can  be  made  with  considerably  smaller 
samples  is  the  issue.  Nj  has  almost  no  bias  so  the  major  problem  resides 
in  the  variance  (which,  for  zero  bias,  equals  the  mean-squared  error).  As 
Eq.  (5)  and  Fig.  3(a)  show,  the  variance  is  low  for  the  ratio  st/N  large 
enough.  But  large  ratios  can  be  attained  with  relatively  small  samples  only 
for  N large.  For  N = 3000,  for  example,  st/N=13.  33  can  be  realized  with 
s=t  = 200,  or  one -fifteenth  of  N;  but  for  N=30,  st/N=  13.33  requires 
s=t  = 20,  two-thirds  of  N.  Fortunately,  Fig.  3(b)  shows  that  smaller 
values  of  st/N  are  required  to  give  a specified  value  of  cre/N  at  the 
30-error  level  than  at  3000.  The  ^ = 1.  0 curve  in  Fig.  3(d)  shows  the 
minimum  value  of  a J N which  can  be  attained  if  we  limit  s and  t to  half  of 
N.  If  we  are  willing  to  accept  larger  samples,  we  can,  of  course,  do  better 
for  the  smaller  values.  Larger  samples  mean  more  time.  For  the  same 
time  relative  to  N,  estimates  of  larger  programs  will  have  lower  or  J N 
(Fig.  3(d)).  Curves  such  as  those  of  Fig.  3 can  be  exploited  to  design  an 
estimation  test  with  knowledge  of  the  trade-off  between  time  and  precision. 
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The  procedure  is  very  simple.  Our  objective  is  to  pick  values  for  s 
and  t which  will  be  likely  to  produce  an  estimate  of  the  quality  we  want. 

We  begin  with  a ballpark  estimate  of  the  number  of  errors  in  the  program, 
based  on  whatever  information  we  have  — length  of  program,  amount  of  pre- 
vious debugging,  experience  with  other  programs  of  the  same  type,  exper- 
tise of  programmers,  Suppose  we  decide  that  there  are  probably  about  150 
errors.  In  that  event  is  the  preferred  estimate  since  it  is  practically 
unbiased  and  has  a lower  V than  NQ  in  that  range.  Had  the  estimated  N 
been  below  50  we  would  have  had  to  check  the  bias  and  dispersion  of  NQ 

and  then  choose  between  N and  N.. 

o 1 

We  will  be  content  with  a = 30.  Then  a /N  = 0.2  and  from  Fig.  3(b)  we 

e e 

s t 

find  that  the  intersection  of  150  and  0.2  is  on  the  curve  for  ^-=  13.  33.  (If 
Fig.  3(a)  contained  a curve  for  N = 150,  we  could  have  found  the  same  infor- 
mation there.  ) Then  st=  13.  33  x 150  = 2000.  We  can  let  each  debugger  find 
about  45  errors,  or  let  one  find  50  and  the  other  40.  Fig.  3(c)  shows  qual- 
itatively that  the  results  will  be  about  the  same.  We  can  also  let  s and  t be, 
say  20  and  100  and  expect  a somewhat  smaller  <r^  but  we  will  have  to  wait 
considerably  longer  for  the  results. 

The  cost  beyond  that  for  the  debugging  which  would  have  to  be  done  any- 
way would  be  identical  for  all  choices  since  the  additional  cost  is  only  for 

s t 

the  common  bugs  and  the  expected  number  of  those  is  — = 13.  33. 

The  situation  would  be  a little  different  if  the  program  were  not  to  be 
completely  debugged.  The  test  could,  for  example,  be  a means  of  compar- 
ing different  programming  techniques.  In  that  case,  it  would  not  only  take 
longer  but  would  also  be  more  expensive  to  find  120  bugs  than  to  find  Q0. 
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We  have  up  to  now  been  discussing  an  estimation  process  involving  two 
debuggers.  Suppose  we  use  3 or  more  and  consider  the  output  of  each  pair 
to  be  a separate  result;  m debuggers  will  give  n = — ■^n?1 — ^ possible  data 
values  which  can  be  combined  to  provide  a new  estimate  with  the  following 
possible  advantages: 

1.  reduced  integer  error 

2.  reduced  variance,  or 

3.  smaller  samples  and  less  debugging  time  for  the  same  variance. 
6.  1.  1 Integer  Error 

It  was  noted  in  Section  3 that  in  any  seeding  /tagging  calculation  an 
error  arises  from  the  fact  that  c is  an  integer,  assuming  one  of  only 
min(s,t)+  1 values.  Any  estimate,  it  follows,  must  also  have  one  of  only  so 
many  values  despite  the  fact  that  N may  actually  be  an  integer  from 
max(s.t)  to  infinity.  This  is  particularly  bothersome  when  the  numbers  are 

4 

relatively  small.  With  a population  of,  say,  10  and  s=t=  1000,  data 

values  c = 100  and  c = 99  lead  to  maximum  likelihood  estimates  of  10,  000 

and  10,  101  respectively, a difference  of  only  1%  of  the  true  population.  But 

with  a population  of  100  and  s=t  = 25,  data  values  c = 6 and  7 respectively 

provide  estimates  of  104  and  89,  a span  of  15%,  with  no  possibility  that  any 

value  obtained  with  the  given  estimate  will  fall  within  the  range.  The  differ- 

N 

ern  e at  the  center  of  the  distribution  is  of  the  order  of  — x 100%  of  N. 

s t 

Values  of  c further  away  from  the  mean  will  lead  to  even  larger  separations 
c = 4 and  5 for  example  give  = 156  and  125  respectively,  a difference  of 

2 ^ 
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31%  of  N.  These  are  not  improbable  values;  4 is  about  1 standard  deviation 

away  from  the  mean  of  the  distribution. 

Integer  error  is  automatically  reduced  when  several  data  values  are 

used,  whether  the  averaging  is  done  on  the  data  values  themselves,  or  on 

the  several  estimates  derived  from  the  individual  data  values.  In  the  last 

example,  for  instance,  using  4.5,  the  average  of  c = 4 and  c = 5,  in  the 

maximum  likelihood  formula  gives  an  estimate  of  138,  while  averaging  the 

values  of  N obtained  with  c = 4 and  5 gives  N =140.  Either  way,  values 
o o 

are  possible  which  cannot  be  obtained  from  a single  trial,  and  increasing  the 
number  of  trials  increases  the  number  of  new  values.  More  in-between 
values  are  likely  to  occur  if  the  final  estimates  rather  than  the  data  are 
combined  — the  average  of  c = 4 and  6 is  not  a new  value,  for  example, 
while  the  nonlinearity  of  the  estimates  makes  repeats  when  estimates  are 
averaged  improbable  — but  such  estimates  may  be  less  desirable  for  other 
reasons. 

6.  1.  2 Variance  and  Mean-squared  Error 

We  can  reasonably  expect  some  reduction  in  variance  in  a multi-trial 
process  regardless  of  the  estimate  formula.  However,  the  bias  of  the  new 
estimate  as  well  as  the  degree  of  improvement  in  variance  do  depend  on  the 
formula. 

One  combining  mode  is  to  compute  an  estimate  for  each  data  value, 

using  any  single -trial  formula,  and  average  the  resulting  n single -trial 

1 * 2 ** 

estimates.  Then  var(ave rage)  = — var (each)  . However  V = var+(bias)  . 

* If  m debuggers  are  used,  the  n = m (m-l)/2  possible  data  values  are  not 
statistically  independent  and  the  variance  relationship  is  not  exactly  true. 
However,  to  avoid  complications  assume  the  n values  are  approximately 
independent,  or  consider  that  n truly  independent  tests  are  made  with  2n 
debugge  rs. 

**  See  Appendix  4 
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Therefore  the  reduction  in  V for  a multi-trial  estimate  found  in  this  way 


depends  on  the  change,  if  any,  in  bias  as  well  as  in  variance. 

n 

Alternatively,  we  can  average  the  n values  of  c getting  c = £ c.  , and 

i=l  1 

replace  c by  c in  any  single -trial  estimate  formula.  Although 

V(  c ) = var  (c  ) = ^ var  (c),  the  effect  of  the  replacement  is  not  obvious  and 

must  be  examined  anew  for  each  formula. 

Taking  another  point  of  view,  we  can  trade  reduced  variance  for  a 
quicker  estimate  by  reducing  s and  t.  Since  V varies  more  or  less  in- 
versely with  the  product  st  (Eq.  (3a)),  having  tagged  and  sampled  sets  of 
t s 

size  — and  — respectively  for  an  n-trial  estimate  will  keep  the  variance 
of  the  n-trial  estimate  approximately  equal  to  that  of  the  corresponding 
single -trial  estimate  using  sets  of  size  s and  t.  Smaller  samples  mean 
less  time.  The  time  saving  may  be  more  than  proportional  to  the  reduction 
in  s and  t since  errors  probably  becomes  progressively  harder  to  find  as 
the  total  number  remaining  decreases.  That  is,  it  takes  longer  for  one 
debugger  to  find  50  errors  than  for  2 to  find  25  each,  starting  with  the  pro- 
gram in  the  same  state.  However,  choosing  time-saving  in  preference  to 
reduced  variance  would,  if  the  estimate  is  biased,  increase  the  bias  which 
also  varies  as  ^ (Eq.  (la)). 

It  might  also  be  borne  in  mind  in  contemplating  multi-trial  estimating 
that  the  multiple  debugging  is  not  all  wasted;  each  debugger  added  to  the 
process  finds  errors  others  do  not  find,  thereby  contributing  to  the  neces- 
sary over-all  debugging  of  the  program. 

6.2  Averaging  Single-trial  Estimates:  Nq  and  Nj 

Let  the  estimate  Nq  be  the  average  of  n maximum  - likelihood  estimates 
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n: 


associated  with  n independent  experimental  values  c^,  i = 1,  2, 


N 

o 


.1  " 

n ,L> 

1 


Although  the  variance  would  be 
bias  would  remain  unchanged: 
less  than  the  variance. 


i n 

N . = - ) — . 

oi  n .Lj.  c. 


i=l 


reduced  by  the  expected  factor  of  — , the 
E(Nq)  = E(No),  so  V wrould  be  reduced  by 


V (N  ) = var  (N  ) + b2 
' o o 


= - var  (N  ) + b2 

= ^[v(no)  -b2]  + b2 

+ (n  - 1)  b2] 

= - V (N  ) + — b2 
n o n 

If  the  bias  is  large,  therefore,  V(N  ) may  be  considerably  greater  than 
1 st 

— V(N  ).  In  view  of  the  almost  zero  bias  of  N , for  — >3  and  N not  too 
no  1 n — 

large,  it  would  make  more  sense  to  average  Nj: 


i n 

N = — t N . 

1 n .Lj  li 

l-  1 

He  re, 

E(Nj)  = E(N  1 ) = N 
V (N  j ) = £ V(Nj) 


r;  st  — 1 v 

rs  = — where  c = — >,  c. 

o - n ,L‘  1 

c i=l 

Since  a single  s appears  in  the  formula  for  Nf  , all  samples  must  have  the 

same  size.  This  implies  that  s = t;  can  therefore  be  written  s2/c. 

The  bias  and  variance  of  N and  N,  were  found  directly  from  the  bias 

o 1 7 

and  variance  of  Nq  and  N j , the  original  computation  of  which  was  based  on 
certain  expansions  of  the  reciprocal  of  a random  variable  with  hypergeomet- 
ric distribution.  The  same  method  cannot  apply  to  Nq  because  c is  not 
hype  rgeometric. 

However  ve  can  use  the  Taylor's  series  method  mentioned  in  Section 
4.  3.  The  results  (see  Appendix  2)  are 

E(Ho)  = n [ i + J ,£L)  + 33i  (it)2  ] (6) 

n 


V(N  ) = N2  [ - (— ) + 9 (— ) ] 

v o 1 n v st  ' 7 2 ' st  ' 1 


Setting  n = 1 reduces  N to  N . Because  of  the  factors  — and  -4-  , 

o o n 2 ’ 

= _ 1 n 
E(No)  < E(Nq)  which  equals  E(N  ).  Similarly,  because  of  the  — — factor 

V(N  ) < — V(N  ) < V(N  ).  The  conclusion  is  that  N is  a better  estimate 
o n o o o 

than  N , as  the  example  in  Table  3 shows, 
o 

Finally  we  mention  the  estimate  having  the  form  of  but  with  c 

replacing  c.  Its  bias  and  variance  can  be  derived  in  the  same  way  as  those 


of  N . Mean  and  mean-squared  error  formulas  for  the  estimates  consid- 
ered are  collected  in  Table  4. 


• X - 


Table  3.  Comparison  of  mean  and  dispersion  of  single-  and 
multi-test  estimates  for  one  example. 


— 


7 . 0 Confidence  Intervals 

Some  measure  of  the  dispersion  of  an  estimate  is  necessary  to  provide 
information  on  the  range  within  which,  given  the  outcome  of  any  particular 
trial,  the  true  value  of  the  quantity  sought  may  be  expected  to  lie.  If  the 
estimate  is  biased  and  has  a large  variance  we  have  no  great  faith  that  the 
true  value  is  close  to  the  estimated  value.  Inserting  the  variance  of  the 
estimate  in  Chebyshev's  inequality  affords  us  one  way  of  quantifying  the 
spread.  Another  is  available  if  we  know  the  distribution  of  the  estimate. 

Still  another  method,  useful  when  the  distribution  and  variance  of  the  esti- 
mate are  not  known,  involves  the  calculation  of  confidence  intervals  based 
only  on  the  known  distribution  of  data  values,  and  on  the  particular  value 
found  experimentally. 

Confidence  limits  aj  and  a2  are  two  random  functions  of  the  estimate 
under  study  and  of  an  arbitrary  non-negative  constant  e < 1 for  which  we 
make  the  following  claim:  If  the  true  value  of  the  quantity  being  estimated 

is  in  the  interval  [a^  a2],  then  the  estimate  actually  computed,  or  else 
some  value  closer  to  the  true  value,  would  occur  in  (1-e  ) 100%  of  trials 
made. 

Each  estimate  is  a function  of  one  or  more  data  values.  Consequently 
any  function  of  an  estimate  can  be  expressed  as  a function  of  the  data  vari- 
able c,  or  the  set  {c^  in  a multi-trial  procedure.  The  probability  that  a 
particular  value  of  an  estimate  will  occur  is  identical  with  the  probability 
that  the  data  points  giving  rise  to  that  value  will  occur.  Since  the  calcula- 
tion of  the  confidence  limits  depends  on  the  distribution  of  the  data  variable, 
the  limits  for  all  estimates  depending  on  a single  value  are  identical,  and 
can  be  found  by  means  of  Eq.  (9)  in  Section  7.  1. 
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7.  1 Confident  e Limits  for  the  Estimates  N and  N, 
o 1 

For  a 100c  °[  confidence  level  our  first  requirement  is  to  find  an  in- 
terval enclosing  a set  of  data  values  occurring  in  (1-£)100%  of  trials.  One 
way  to  do  this  is  to  replace  the  hypergeometric  distribution  which  describes 
the  probability  of  c by  its  normal  approximation  (mean  = st/N,  variance 

a ^ — S ^ ^ ^ ) . This  done,  we  determine  X such  that  ( 1 -e  >100%  of 

w 1ST 

all  occurrences  of  c will  be  within  a distance  of  Act  from  the  mean.  X is 
tabulated  directly  in  [6]  p.  558,  or  can  be  found  using  a table  of  error  func- 
tions. An  interval  on  the  c-axis  satisfying  the  stated  condition  is  described 
by 

P - Acr  < c < ^ + An-  } > 1 - e (8) 

Our  objective  is  to  find  the  two  values  of  N,  N and  N,  , for  which  the 

value  of  c found  experimentally  is  at  the  ends  of  the  allowed  interval  (see 

Figure  4).  The  left  inequality  should  provide  the  largest  mean  = st/N  for 

which  c is  still  in  the  interval,  and  therefore  the  smallest  N,  i.  e.  , the 

lower  confidence  limit.  We  might  replace  c by  st/N  or  by  - -r^  - 1 

' o N j + 1 

depending  on  the  estimate  of  interest;  cither  Nq  or  N computed  from  c 
would  then  be  the  fixed  quantity  rather  that  c.  The  procedure  and  results 
would  be  identical,  as  noted  previously,  since  we  are  still  governed  by  the 
assumed  normal  distribution  of  c as  expressed  in  (8)  rather  than  by  the 
distribution  of  Nq  or  Nlf  neither  of  which  is  known. 

If  cr  were  constant  we  could  immediately  solve  the  two  inequalities  foi 
N,  thereby  finding  the  confidence  limits  with  no  difficulty.  Since  cr  is, 
instead,  a function  of  N,  the  procedure  is  not  quite  so  straight-forward. 
Details  appear  in  Appendix  3 in  which  it  is  shown  that  the  confidence  limits 
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FIG.  4.  CONFIDENCE  INTERVAL. 


v.T:  A"**''  ■*»  ' 


- 


at  the  100e%  level  are  the  2 largest  roots  of  g(N)  = 0,  where 


1 

i \ 


st 


g(N)  = N3  - ^ (2  + N2  + ^ [st  + X2(s+t)  ] N - (*r  X )“  , 


(9) 


and  X depends  on  the  preselected  e . 
Examples : 

s = t = 25 
1)  c = 4 


N = — = 156 
o c 


(s  t , . 134 

1 c + 1 


10%  confidence  level:  € = 0.  1,  X = 1.  6449 


g(N)  = N3  - 418N2  + 29,  699N  = 66,  064 


Confidence  interval  = [88,  328] 

50%  confidence  level:  € = 0.  5,  X = , 6745 


g(N)  = N3  - 330N2  + 25,  303N  - 11,  10" 


Confidence  interval  = [12  1,209] 
2)  c = 7 


N = 89 

o 


N 1 = 84 


10%  confidence  level: 


g(N)  = N3  - 213N2  + 970  IN  - 21,575 


Confidence  interval  = [62,  148] 

Confidence  intervals  are  not  unique.  We  can  obtain  limits  more  sym- 
metrically placed  about  the  estimate  by  choosing  different  values  for  the 
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left-  and  right-hand  occurrences  of  X in  (8)  and  modifying  the  procedure 
accordingly.  In  any  event,  the  results  are  approximate  since  they  are  based 
on  the  normal  approximation  to  the  distribution  of  c. 

7.2  Confidence  Limits  for  the  Estimates  N and  N. 
o [ 

The  random  variable  in  the  expressions  for  N and  N,  is  c=—  T c.  . 

o 1 n i l 

c is  an  asymptotically  normal  random  variable(central  limit  theorem)  with 
mean  equal  to  the  mean  of  each  individual  c,  or  st/N;  and  variance  equal  to 
1/n  times  the  variance  of  each: 


Determining  X 


— 2 — st 

a =var(c)^ 


as  before,  we  have 


(N-s  )(N-t) 
N(N-  1) 


which  leads  us  to  another  version  of  g(N): 


2 2 2 2 

gfNM  = N3  - f l+^(2+^)]NZ+^  [2+^7+— ] N - (^)  (1+^-)  (10) 


nc 


n c 


Example  3 

s = t = 25 

c.  = 4,  11,6  ; c = 7 
i 

No  = 89 
For  f =0.1 

g(N)  = N3  - 192  N2  + 8727N-15,  165 
10Tc  confidence  limits  are  70  and  120. 
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The  width  of  the  interval  is  considerably  less  for  N than  for  Nq 
computed  with  = 7,  as  anticipated  from  its  smaller  mean-squared  error. 

In  fact,  except  for  the  effect  of  bias,  the  two  widths  should  be  proportional 
to  the  respective  standard  deviations,  the  ratio  being  1/ 4rT  . In  this  case 
it  is  almost  exactly  that,  the  bias  apparently  playing  a small  role. 

If,  instead  of  using  the  central  limit  theorem,  we  use  the  normal  ap- 
proximation for  each  c.  we  have  a slightly  different  variance  , 

<r  ^ ,(N~S)(N  t)>(see  [l])  leading  to  slightly  different  results: 

nIN 

2 2 2 

g(N)=N3  -^(2+^)  N2+-[st+^-(s+t)  ] N -1~2~  (10a) 

c nc  c n c 

The  same  example  now  gives  limits  of  7 1 and  118,  almost  identical  with  the 


above. 


i 

1 

J 

l 

4 

i 


* 

? 

[ 
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8.  0 Other  Models  - Assumption  of  Variable  Intrinsb  Difficulty 

All  the  estimates  which  have  been  examined  were  based  on  the  assump- 
tion of  purely  random  choice:  all  errors  were1,  one  might  say,  laid  out  be- 
fore any  debugger  who  had  but  to  close  his  eyes  and  choose.  The  discussion 
on  equal  probability  in  Section  2 noted  several  varieties  of  challenge  which 
might  be  launched  against  that  hypothesis.  In  this  section  we  attempt  to 
describe  models  providing  for  variable  intrinsic  difficulty. 

8.  1 Model  2 - Variable  Difficulty,  Program  Distribution  Unknown 

Make  the  following  assumptions: 

i)  All  bugs  v.  an  be  assigned  at  sight  to  categories  based  on  difficulty 
of  discovery. 

2!  Within  each  category,  errors  are  undifferentiated  --  subject  to 

random  discovery  with  equal  probability. 

Suppose  there  are  k difficulty  categories.  Tag  ;or  seed)  and  sample 

as  before.  By  virtue  of  assumption  2,  c./s.  -t./N.  where  t.,  s.,  c.  are  the 
7 1 1111  ill 

tH 

tagged,  sampled  and  common  bugs  respectively  in  the  i category;  some 

tH 

may  be  0.  N.  is  the  unknown  number  of  program  bugs  in  the  i category. 

In  principle,  one  may  apply  all  Model  1 information  to  each  category  sepa- 
rately, deriving  category  estimates  using  any  estimator  previously  discussed. 
For  example,  using  for  simplicity,  we  have  category  estimates 
N . = s.t./c i=l,  ...  , k,  which  can  be  found  whenever  s.  and  t.  are  non- 

Ol  1 1 1 11 

zero.'  Since  the  more  difficult  categories  will  probably  be  empty  at  first, 
we  will  not  in  general  have  an  estimate  of  the  total  population.  We  can,  in 
theory  at  least,  continue  to  test  until  enough  errors  in  all  categories  are 

We  retain  the  convention  that  N . = 2s.t.  when  c.  = 0. 

Ol  1 1 1 
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available.  However,  a possibly  more  efficient  way  is  to  estimate  whatever 
categories  appear  in  sufficient  numbers  after  a brief  test  to  make  the  esti- 
mate reliable;  continue  debugging  without  testing'1'  (i.  e.  , with  one  debugger) 
but  keeping  count  of  the  number  of  bugs  found  in  each  category  which  lias  not 
yet  been  estimated;  and  finally  conduct  similar  tests  to  estimate  the  missing 
categories  when  their  appearance  is  frequent  enough,  adding  the  pre-estimate 
count  in  each  <.ase  to  get  a number  comparable  with  the  initial  estimates.  If 
the  reason  for  making  the  original  estimate  is  to  gauge  reliability  at  the  end 
of  a finite  debugging  process,  an  error  count  --  though  not  by  category  -- 
would  be  required  in  any  event  in  order  to  estimate  how  many  remain  still 
undiscovered. 

As  an  example  consider  Table  5 where  3 categories  of  difficulty  are 

assumed.  The  true  figures  in  the  total  column  are  of  course  unknown.  Two 

sets  of  experimental  values  for  c.  are  shown.  In  ai  the  c.  were  chusen  at 
r 1 l 

their  expected  values  for  the  true  data  E(c^)  - s m / IN\  ; the  category  estimates 
are  therefore  exactly  right.  In  b)  the  values  are  not  ideal.  The  r column 
and  the  remaining  estimates  will  be  defined  in  connection  with  Model  3. 

A major  difficulty  is  that  numbers  may  be  small;  getting  large  enough 
samples  within  each  category  for  low  bias  and  variance  may  require  exten- 
sive testing. 

8.  2 Model  3 - Variable  Difficulty,  Program  Distribution  Known 

A third  assumption  makes  it  possible  to  complete  the  estimate  with  one- 
trial,  from  incomplete  category  estimates: 

3)  The  distribution  ratio  of  program  errors  by  category  is  known. 

* This  is  suggested  in  order  to  avoid  the  cost  of  continued  duplicate 
debugging. 


39 


Category  |Tagged  | Sampled 


Type 


1 easy 

2 medium 

3 hard 


T otal 


480 

. 6 

100 

. 3 

0 

. 1 

a)  Ideal  | b)  Non-ideal 


Model  2 

480  x 400  , -*  ~ ^ 

Nol=  160  =Uu0 

, . 480  x 400 

b)Nol=  150  = 

N = 10-P  :*  — = 600 
oZ  10 

, . _ 100  x 60 

o2  “ 12 

N ~ no  estimate 
o3 

N , no  estimate 
o3 

Model  3 - first  proceduie 

N (r,)=^^  = 2000 
o l .6 

N (r  )=-^f^  = 21 
o 1 .6 

N (r,)  = 2000 

o Z .3 

No(r2)=  3=16 

Average  N = Z000 

Average  N =190 

Table  5.  Example  with  errors  differentiated  by 
difficulty  (Models  2 and  3). 


S.  Z.  1 First  Estimating  Proc  edure  - for  Tagging  or  Seeding 

The  new  assumption  provides  us  with  the  ratio  r =N^/N  for  all  i. 

Using  any  one  of  the  category  estimates  of  Section  8.  1 we  find  N0  = Noi/rj- 

In  fact  we  have  as  many  estimates  of  N as  we  have  category  estimates. 

Table  5 contains  an  example  of  this  estimating  procedure. 

N has  the  same  ratio  of  bias  and  standard  deviation  to  mean  as  N .: 
o 

E(N  ) = — E(N  .) 
o r oi 

i 


var(N 


= var(N  .) 
Z oi 


a-  (N  ) = — a (N  . ) 
o r.  oi 
i 


V (N  ) = V (N  ) 
0^01 
r . 

l 


a (N  ) = — a-  (N  .) 
e o r.  e oi 
i 


8 . Z . Z Second  Estimating  Procedure  - For  Seeding  Only 
With  the  third  assumption  we  can  also  use  the  seeding  variant  to  esti- 
mate N directly  without  finding  category  estimates.  Let  the  program  have 

E = E,  + ...  + F,  errors,  E.  representing  the  number  in  the  ith  category. 

IK  1 

Construct  and  insert  a matching  set  of  t = t + . . . 4-  t^  errors,  C/t  = E.  / E, 
t.  4-  0.  The  total  number  of  errors  after  seeding  is  N = Nj  + . . . + 

where  N.  = E.  + t. 


N. 

= E. 

+ t. 

i 

i 

l 

N. 

E. 

t. 

i 

i 

_ i 

N 

E 

t 

The  debugger  finds  s = Sj  + . . . + s^  bugs,  where  s^  may  be  0. 

If  the  seeding  approach  is  used,  the  ratio  ri  is  computed  with  the  seeded 
bugs  included  in  The  seeded  bugs  need  not  be  distributed  among  the 

categories  in  the  same  proportion  as  the  original  program  errors. 


Since  the  seeded  bugs  are  assumed  indistinguishable  from  the  original,  we 
can  again  reasonably  expect  that 


c.  t. 

x x 

s.  " N. 

X X 


where  now  and  s^  may  jointly  be  0,  but  t^  4 0. 


The  total  number  of  seeded  bugs  uncovered  is 

t. 

n 

1 X 


= Yc=Ys  i = Y 6 - 

L ci  L Bi  N.  L 8i  N N 
. 1 « • 


We  are  therefore  led  to  the  same  ad  hoc  estimate  as  for  the  equal  probability 
Ct  st 

case,  N = — . 

c 

The  c^  are  hypergeometric  by  virtue  of  assumption  2,  but  the  distri- 
bution of  their  sum  c is  unknown,  although  asymptotically  normal.  The 
Taylor's  series  method  (Appendix  2)  with  normal  approximation  for  c^  (or 

only  for  c if  k is  large)  can  be  used  to  find  the  mean  and  mean-squared 

s t 

error  of  the  estimate  — . Mean  and  variance  of  c are  the  sum  of  the  means 

c 

and  variances  of  c.; 

l 


s.t. 

E<ci>  - 

X 


k s.t.  . 

E(c)  = Y 

v c.  N 

1 = 1 X 


s.t.  (N.  -s.  )(N.  -t. ) 

X 1 . X 1 X X 


var(c.)  = • — - — 1 — — var  (c)  = ^ var(c  ) 

i hh  i=l 


Higher  moments  are  found  from  the  normal  approximation 

= 0 


U.(c)  = 3 fvar(c)] 
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Table  6 shows  an  example  of  this  approach  with  ideal  and  non-ideal 
experimental  values.  (Both  are  ideal  in  the  sense  that  the  seeded  set 
matches  the  true  distribution  exactly.  ) 


Seeded 

Sampled 

T otal 

K3 

Type 

t. 

s . 

N. 

1 

, Cr  st  580x1  00 
a)N=c=  29 

1 

1 

1 

1 

a)  Ideal 

b)  Non-ideal 

i 

easy 

60 

480 

1200 

24 

20 

= 2000 

2 

mediurr 

30 

100 

600 

5 

6 

, , :T  580x1  00 
b)N=  26 

3 

hard 

1 0 

0 

_ .. 

200 

0 

0 

= 2231 

1 

total 

100 

580 

2000 

29 

26 

Table  6.  Proportional  seeding  example  with  errors 
differentiated  by  difficulty  (Model  3). 
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9.  0 Conclusions 


The  modified  maximum  likelihood  estimate  considered  under  the 
equal-probability  assumption  is  the  estimate  of  choice  in  a single-trial  test 
if  the  total  number  of  errors  exceeds  about  50;  its  bias  is  practically  zero 
and  its  variance  reasonable.  Its  variance  was  found,  furthermore,  to  vary 
in  predictable  ways  with  various  ratios  among  error  population,  sample 
size  and  size  of  tagged  or  seeded  set.  As  a consequence  it  is  possible  to 
design  a seeding/tagging  test  optimally  for  the  desired  precision.  Graphs 
make  the  choice  of  s and  t a simple  procedure.  Estimates  of  larger  values 
can  be  made  in  relatively  less  time. 

For  N < 50  a decision  must  be  made  between  N.  , and  N with  its 

1 o 

higher  bias  but  lower  mean-squared  error. 

Multi-trial  procedures  can  decrease  the  dispersion  still  further.  Of 
the  two  types  considered,  the  better  one  replaces  the  random  variable  c by 
its  average  over  the  several  trials. 

A brief  treatment  of  estimates  for  models  other  than  equal  probability 
indicates  that  the  estimates  are  closely  related  to  those  for  equal  probability, 

In  summary,  estimates  of  adequate  accuracy  and  precision  are  avail- 
able. The  viability  of  s eeding/tagging  reliability  tests  rests  on  the  answers 
to  the  practical  questions  which  can  be  raised. 
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Appendix  t . Derivation  of  Second  Order  Approximations  for  E(N^),  V(Nq) 
and  V(N^  ). 

A.  Derivation  of  Eg.  (1  ) 

The  exact  expression  for  E(Nq)  derived  by  Chapman  consists  essen- 
tially of  the  first  m terms,  m arbitrary,  of  an  infinite  series  plus  a 
remainder  term.  It  can  be  written  in  the  following  form: 


E(N  ) = st  { (1  -K)  [a  + — a,  + 2 — or,  + 6 — a +.  . . + (m-1 ) ! — ] 
' o L 1 r^  2 r 3 r 4 ri  m 


+ — — [l a,  - 2 — a , - . . . - (m-1  ) ! a 

ij  1 r(  2 r 3 ri  m 


+ (l-Po)E(Rm|  s*0)  } 


where  K = 


N -s  -t 

.77. — P for  s+t  < N 
N+l  o — 


| 0 otherwise 
N+l 


T (s  + l)(t+l) 

_ N+i 

fi  (s+i)(t+i)  ai-l 
i-1 


for  i > 2 


ri  = 1 - Yj  P(c  I N+i,  s+i,  t+i) 
c = 0 


R = remainder  term 
m 


st 


To  derive  his  first  order  approximation  Chapman  assumes  that  for  — >10, 
the  following  approximations  hold: 


P = 0,  K = 0,  R = 0,  all  r.  = 1,  all  a.  = ( -^ ) 
o m 1 i st 
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The  result  is 

E(No).N[!  + <£>  + 2 (£)  ) 

N st 

from  which  it  appears  that  the  bias  depends  only  on  the  ratio  — , or  . 

N i 

The  critical  assumption  is  (— ) , which  is  close  for  N,  s and  t very 
large  but  otherwise  introduces  considerable  positive  error,  increasing  as  N 
gets  smaller. 

The  remaining  assumptions  seem  justified: 

1.  Since  the  series  converges,  R — *0. 

° m 

s t 

2.  Pq  is  generally  small  except  for  N very  large  and  — very  small 

simultaneously. 

N 

— r small  means  that  the  mean  of  the  distribution  of  c is  close  to  the  origin, 
st  6 

s t 

and  N large  means  that  for  a given  value  of  the  variance  is  large  since 
[nM  y ^ -*  1 • Large  variance  implies  relatively  high  probability  at  c = 0. 
Some  order-of-magnitude  values  are: 


27, 000 


100,  00C 


We  can  therefore  assume,  except  where  signalled  by  small  -^r  , that  PQ=0. 

3.  It  follows  that  K,  which  is  less  than  P , is  approximately  0 and 
1 -K  = 1. 

4.  For  i small  the  probabilities  in  the  expressions  for  r^  are  in  the 

s t 

tail  of  the  distribution,  and  are  small  for  reasonable  -ry  (see  Figure  1 ). 

r . 

Therefore  r.  = 1 and  — = 1. 

1 rl 

5.  In  most  cases,  the  second  term  is  very  much  less  than  the  first 

term  and  can  safely  be  ignored.  The  first  term  is  dominated  by  for 

s t 

reasonable  values  of  -yy  as  the  definition  of  shows.  The  ratio  of  the 

1 o S t hi 

second  term  to  the  first  is  about  — j . For  -yr  exceedingly  large  = — is 

Po  °1  -6  . stS 

very  small  and  — =-  can  be  significant.  However,  if  P =10  and  -rr=  10, 
Of.C.  o 

po  -4  1 

— — = 10  . That  is,  we  make  an  error  of  about  . 01%  in  E(N  ) by  neglecting 

a\  L o' 

the  second  term. 

If  we  make  only  the  assumptions  discussed  above  omitting  the  a 
assumption,  we  are  left  with 

E(Ho>  = st  [a'j  + o>2  + 2a ^ + . . . + (m-1 ) !<ar^].  (1  ) 


B.  Derivation  of  Equation  (la) 


Let  a = — . Define  k.  for  i = 1 , 2,  ...  by  a.  - k .a  * 
o st  1 ’ 1 1 o 

Rewrite  Eq.  (1  ) as 


E(No>  = N [k,  +k2  $ + 2k3(ii)  4-  . . . - (m-O  ! k^)”'1 


It  remains  to  be  shown  that 


, 1 + 1/N 

1 " (1  +l/s)(l  + 1/t) 


-v— ■ -V.T  ‘ " ’ «*'  - ■ « * \ ■ 


* **•  < 


which  is  done  by  factoring 


a 

Q 


~ from  a,  ; and  that 
st  1 


k. 

1 


= ki-l 


1 + i/N 

(1  + i/s  )(1  + i/t) 


i = 2,  3, 


The  form  holds  for  i = 2: 


N + 2 _N  1 + 2/N 

2 1 (s+2)(t+2)  oKl  st  (1  +2 / s )(1  +2/ 1) 


2 1+2/N 

o 1 (1+2/  s )(1  +2/t) 


*’•  k2  = 
induction. 


1+2/N 


1 (1+2/ s)(l  +2/ 1) 


and  the  general  form  follows  readily  by 


C.  Derivation  of  Equation  (3)  and  (3a) 

The  exact  expression  for  V(nQ)  has  a structure  similar  to  that  for 

r * 

E(N  ) containing  the  expressions  a.,  — , (1-K),  P and  R’  (remainder 
° 1 rl  o m 

term  somewhat  different  from  R ).  Chapman's  approximations  lead  to 


2 3 

V(No)  = n2  7 $ + 38  1 

In  accordance  with  the  preceding  discussion,  we  permit  all  but  one 
approximation  to  stand  to  arrive  at  Eq.  (3)  and,  with  the  same  transforma- 
tion as  before,  at  Equation  (3a). 

D.  V(N1) 

An  exact  form  for  V(Nj),  derived  using  Chapman's  method,  is 
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fs  + lHt+l  WN+lKN+2)  „ „Tr21_  3 _ _N  + 3 x ,_j  ^Nt4)(N+3j + 

v (Nj  > = (stZ)(t+2)  l K '-r1  r1(s+3)(t+3)+2r1  ( s +4 )( t+ 4 ) ( s + 3 ) (t+ 3 ) 

+ (m-2)  ! ^ff^Sl:;{rtl)]  + (8+1  )2(t+1  )2(l  -K)E(Rm'  ')‘(N+1 


The  same  approximations  as  before  reduce  this  to 


(s+2)(t+2) 


(s  + 3)(t+3)  + 2 (s+4)(t+4)(s  + 3)(t+3) 


(N+4)(N+3) 


N+m).  . , (N+3)l  _ (N+1)2 
s+m).  ..  (t+3)J  1 1 


+ <m'2,!  (7m 


V(N  ) = (s  + 1 )2(t+l  )2  [o'2+c*3+2c>;4+.  . . +(m-  2)  la^j  - (N+l 


which  can  also  be  written  as 


V(Pf1l=(s+U2(t+l|2(jjl  [<k2-k1)2  + ki^t2k4(Ji)  +. 


+(™-2)!km(fi|m-2]  (5a) 


Appendix  2.  Taylor's  Series  Derivation  for  E(N)  and  V(N) 

A.  Let  N be  any  estimate  of  N.  N is  a function  of  c,  say  w(c).  Let 
v(c)  = fw(c)  - N]2.  Then  E[w(c)j  = ^ w(c)P(c  ) and  E[v(c)]  = ^ v|JP(c ) are 

L C 

respectively  the  mean  and  the  mean-squared  error  of  estimate  N.  Conse- 
quently any  method  for  evaluating  the  expected  value  of  a function  of  a ran- 

A A 

dom  variable  can  be  useu  to  find  both  E(N)  and  V(N). 

B.  Let  the  mean  of  c be  m,  its  variance  be  j2  and  its  ktk  central 

k 

moment,  E[(c-m)  ],  be  p,  . Let  g(c)  be  a function  (we  will  later  let  it  be 
both  w(c)  and  v(c))  which  can  be  expanded  in  a Taylor's  series  about  m: 

, >2  3 

g(c)  = g(m)  + (c-m)  g'(m)  + g"(m)  + - g"'(m)  + . . . 

Multiply  each  term  by  P(c)  and  sum  over  all  c.  The  result  is 
E[g(c)]  = g + g'E(c-m)  + E[(c-m)2]  + E[(c-m)3]  + , . . 


where  g and  its  derivatives  are  evaluated  at  c=m  , 
E(c-m)  = 0 

E[(c  -m)2]  = nr2 

E[(c-m)3]  = . 

Then 

E[g(c)]=  g +^g”  4s'"  +^fg(4) 


(ID 


If  a truncated  portion  is  to  be  a reasonable  approximation  of  E[g(c)], 
the  series  must  converge,  and  rapidly.  If  g(c)  does  not  vary  too  much  near 
m,  the  derivatives  will  be  small.  But  the  (c-m)k  P(c)  terms  must  not  be 
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too  large;  this  requires  that  the  domain  of  c not  spread  too  far  from  m 
and/or  that  the  remote  points  have  very  low  probability. 

C.  We  find  E(N)  by  replacing  g(c)  by  N = w(c) 


E(N)  = w + V 


, ^4  (4; 

4 ! 


(12) 


(4) 

where  w,  w"  and  w'  are  evaluated  at  c = m. 

We  find  V(N)  by  replacing  g(c)  and  its  derivatives  by  v(c)  and  its 
derivatives.  Evaluating  at  m gives 

v = (w-N)2 
v1  = 2(w-N)w' 

v"  = 2(w-N)w"  + 2(w')2 
v1"  = 2(w-N)w'"  + 6w'w" 

v(4)  = 2(w-N)w(4)  + 8w'w"'  + 6(w")2 
Substituting  these  in  Eq.  (11)  we  find 


V(N)  = (w-N)2  + o-2  [(w1)2  + w"(w-N)]  + [3w'w"  + w'"(w-N)] 

+ 1~2  [ 4w'  w"'  +3(w")2  + w^(w-N)].  (13) 

D.  Application  to  Estimate  N 
o 

The  mean  and  variance  of  c are  known.  The  higher  moments  are  not 
readily  available  but  can  be  calculated  from  the  characteristic  function 
of  the  hypergeometric  distribution  or  from  the  formula  for  skewness  [2]. 
p3  was  in  fact  derived  by  the  writer  and  substituted  in  Eqs.  (12)  and  (1  3)  to 
write  expressions  for  E(No)  and  V(Nq).  However  the  results  in  specific 
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’•—•V  ' 


cases  were  uniformly  low  indicating  the  need  for  more  terms.  We  can 


avoid  deriving  by  using  a normal  approximation  for  c[l]  for  which 


higher  moments  are  more  accessible. 


In  that  event 


2 _ s_t  (N-s  )(N  - t)  _ st 


i,  = 0 


i t s t , 2 2 

q 


where  q = 


For  N 


w(c)  = st/ c 


w — w(m)  = N 


w'(c)  = -st/c 


w'  = -N(N/st) 


w"(c)  = 2st/c' 


w"  = 2N(N/st) 


w"'(c)  = -6st  / c 
w^(c ) = 4 ! st/ c 


w"’  = -6N(N/st) 
w(4)  = 4 !N(N/st)4 


Substituting  in  Eqs.  (12)  and  (13),  we  obtain,  finally, 
E(No)  = N[l+q  £)  ♦ 3q2  £)2] 


V(No)  = N2[q  <£>  + ,q2  £>2] 


, . , st  (N-s)(N-t)  (N-2s)(N-2 

U ^ was  found  to  be  ^ } 


but  u . does  not  follow 


the  pattern  of  r , u . ; it  is  probably  the  sum  of  such  a term  and  another. 
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E.  Application  to  Estimate  Nq 

“ t _ i n 

The  form  of  N = — where  c = — 7 c.  is  identical  with  that  of  N . 

o c n jEj  i o 

The  only  difference  is  that  the  random  variable  is  c rather  than  c;  the 
quantities  m,  d2,  in  Eqs.  (12)  and  (13)  must  therefore  be  mean  and 

central  moments  of  c.  As  the  average  of  n random  variables  with  identical 
distributions: 


E(  c ) = E(c)  = |f 


2,-.  1 2/  v 

<r  ( c ) = - o'  (c) 


Under  the  normal  assumption  for  c,  c is  itself  normal,  and 


2 — 1 st 

a (C)  = n N q 


M3(  c ) = 0 


, — , , 4 — 3 st,2  2 

M4(c  ) = 3d  (c  ) = ) q 


We  need  only  replace  q by  ^ in  Eqs.  (2)  and  (4)  to  get  the  corresponding 


expressions  for  N : 
o 


E(Nq)=N[i  +J(^)+3^(^)2] 

n 


(6) 


v'v  =n2S  (^)  + 9%  <tL,2,  . 


(7) 
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Appendix  3.  Confidence  Intervals 


A.  Confidence  Limits  for  N and  N( 

o 1 


Assume  that  c has  a normal  rather  than  hypergeometric  distribution. 

s t 

The  mean  of  the  normal  approximation  [l  ] remains  ^ and  the  variance  is 
only  slightly  changed:  <t2  - • (N -s )(N_ t)  _ For  a iQ0e%  confidence  level, 

lit  XtT  be  the  half-width  of  a symmetrical  interval  about  the  mean  containing 
(1  -c  1100%  of  all  occurrences.  Then 


P{irr-*°'^c<fr  + Xo-}  = l - 


Known  quantities  are  s,  t,  X,  e and  the  experimentally  determined  c.  The 

2 

only  unknown,  when  <x  is  replaced  by  the  square  root  of  <r  as  given  above, 
is  N.  F rom  the  left-hand  inequality,  we  get 


st  , s t 

N ^ N 


[(f)2  tc2  -2cf)] 


st(N - s )(N -t)  [ (st)2N  + c2N3  - 2cstN2] 

X2 

c n3  _ N2(2c^t  + st)  + N j-  M2  + g2fc  + st2]  . (st)2  < Q 
X2  X2  X 

2 2 

g(N)  E N3  - N2  — (2  + — ) + N ^5  [st  + X2(s+t)]  - (— ) <■  0 

C C Cd  c 

c 


From  the  right-hand  inequality,  we  have 


* 


1 


st..  / st  (N  - s )(N  -t ) 

K+Xylv  M2 

IN 


st  (N-s)(N-_t)  J_  r(st  2 + 2 
N N2  “X2  N 


2^ff' 


which  is  identical  with  the  second  exjireasion  above  and  therefore  leads  to 
the  same  result,  namely  g(N)^0  where  g(N)  is  the  polynomial  in  Equation  (91 
Since  g(N)  0 represents  both  inequalities  in  (81,  it  is  satisfied  by  all 
values  of  N in  the  confidence  interval.  The  lower  limit  N of  the  confi- 
dence interval  is  characterized  by  the  fact  that  smaller  values  of  N are  not 
in  the  interval  and  therefore  do  not  satisfy  g(N)  < 0 but  larger  values  are 
and  do.  Therefore  N is  the  next  integer  at  or  below  a solution  of  g(N)  = 0 

cl 

such  that  g(N  - t)>0  and  g(N  + 1 ) < 0,  i.  e.  near  N , g(N)  changes  from 
positive  to  negative  with  increasing  N.  Similarly  the  upper  limit  of  the 

confidence  interval  is  the  integer  at  or  just  above  a larger  root  of  g(N)  = 0 
where  g(N)  changes  from  negative  to  positive  with  increasing  N.  In  other 
words,  the  confidence  limits  are  approximately  two  roots  of  g(N)  = 0 
between  which  g(N)  is  negative.  Inspection  of  g(N)  tells  us  that  g (0 ) is 
negative  and  that  its  derivative  at  0 is  positive.  From  Descartes'  rule  of 
signs,  we  know  that  g(N)  = 0 has  no  negative  and  either  one  or  three  positive 
roots.  From  the  genesis  of  the  equation  we  know  it  has  two  positive  roots 
since  the  interval  limits  do  exist  and  are  distinct.  Therefore  it  has  three 
real  roots,  all  positive.  It  is  apparent  that  g(N)  has  the  configuration 
shown  in  Fig.  5 and  that  the  confidence  limits  are  the  two  upper  roots. 
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I 


F 


Figure  5.  General  Configuration  of  g(N) 


B.  Confidence  Limits  for  N and  N, 

o 1 

We  begin  with  a set  of  experimental  values  {c^:i  = n}.  Esti  - 

mates  N and  N depend  on  the  random  variable  c = — 7 c.,  which  is  as 

o 1 n l 

ymptotically  normal  with  mean  = and  variance  tr^  = . Or 

N nN  N(N  - 1 ) 

i:  we  use  the  normal  approximation  for  each  c.,  -.s ^(N -t)  . 


nN 


N 


Equation  (8'  becomes 


P {|j-  - Xcr  ^ c < -pj-  + Acr  } = 1 - e 


-2 


Using  the  first  form  for  cr  , the  left-hand  inequality  leads  to: 
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* , -r-xrw"  *„■ ' 


*V  - * *’6  ■ 


Si  . x pZL  (N-sj(N-t)  - 
N NnM  N(N-l)  - 


2 si  (N-s)(N-t)  ^ ,st,  , ,2  st 


nN  N(N-l) 


tf)  ^(ff)  + (c)  * 2c 


X2st(N2  - sN-tN+st)  > (st)2n(N-l  ) +nN2(N-l  )c2  - 2stnN(N-l  )c 

nc.2  - N2(nc“  + 2 stnc  + X2st)  + N [(st)2n  + 2stnc  + X2  st(s+t)] 
- (st)2  (n+X2)  < 0 


g(N)EN3-N“  [1  +—  (2+—  )]+N-r  [2+^  + ^-^U]  - (Si)  (1  +— )<0 

~ ~ n — 

c nc  c c nc  c 


The  right-hand  inequality  leads  to  the  same  form.  The  reasoning  described 
in  part  A of  this  appendix  therefore  establishes  the  confidence  limits  as  the 
two  largest  roots  of  g(N)  = 0 where  g(N)  is  as  defined  in  Equation  (10). 


Miscellaneous  Proofs 


A.  Show  that 


1 + i/N 

(1  + i / s ) ( 1 + i / 1 ) — 


1 + i/N 
(1  + i/  yfst  )‘ 


2 2 2 

(a-b)  = a - 2ab  + b“  > 0 


2 2 

a + b > 2ab 


2 9 ____ 

Let  a“  = c,  b = d;  then  ab  = J cd 


+ d > 2 -Jed 


Add  1 + cd  to  both  sides 


1 + c + d + cd  > 1 + 2 Jed  + cd 


(1  + c)  (1  + d)  > ( 1 +Jcd)‘ 


Let  c = i/s,  d = i/t 


Then  (1  + i/s)  (t  + i/t)  > (1  + i /\/sT")2 
And  1 + i/N  < 1 + i/N 

which  was  to  be  proved. 

B.  Show  that  V = var  + (bias)2 

Let  N be  any  estimate  of  quantity  N.  Let  m and  b be  the  mean  and 
bias  respectively  of  N:  m - N = b and  E(N  - m)  = 0 

V(N)  = E f(N  - N)2]  = E {[(N  - m)  + (m  - N)]2} 

= E [(N  - m)2]  + (m  - N)2  + 2(m  - N)  E (N  - m) 

= var  (N)  + b2 


- t,  r '*•*>  * t 
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METRIC  SYSTEM 


BASE  UNITS: 

Quantity 


I’mt 


SI  Symbol  Formula 


length 

mass 

time 

electric  current 
thermodynamic  temperature 
amount  of  substance 
luminous  intensity 

SUPPLEMENTARY  UNITS: 

plane  angle 
solid  angle 

DERIVED  UNITS: 
Acceleration 

activity  (of  a radioactive  source) 

angular  acceleration 

angular  velocity 

area 

density 

electric  capacitance 

electrical  conductance 

electric  field  strength 

electric  inductance 

electric  potential  difference 

electric  resistance 

electromotive  force 

energy 

entropy 

force 

frequency 

illuminance 

luminance 

luminous  flux 

magnetic  field  strength 

magnetic  flux 

magnetic  flux  density 

magnetomotive  force 

power 

pressure 

quantity  of  electricity 
quantity  of  heat 
radiant  intensity 
specific  heat 
stress 

thermal  conductivity 
velocity 

viscosity,  dynamic 

viscosity,  kinematic 

voltage 

volume 

wavenumber 

work 


metre 

kilogram 

second 

ampere 

kelvin 

mole 

candela 


radian 

steradian 


metre  per  second  squared 

disintegration  per  second 

radian  per  second  squared 

radian  per  second 

square  metre 

kilogram  per  cubic  metre 

farad 

siemens 

volt  per  metre 

henry 

volt 

ohm 

volt 

joule 

joule  per  kelvin 

newton 

hertz 

lux 

candela  per  square  metre 
lumen 

ampere  per  metre 

weber 

tesla 

ampere 

watt 

pascal 

coulomb 

joule 

watt  per  steradian 

ioule  per  kilogram-kelvin 

pascal 

watt  per  metre-kelvin 
metre  per  second 
pascal-second 
square  metre  per  second 
volt 

cubic  metre 
reciprocal  metre 
joule 


m 

kg 

s 

A 

K 

mol 

cd 


rad 

sr 


nvs 

(disintegration's 

rad/s 

rad/s 

m 

kg/m 

F 

A-sV 

S 

AV 

H 

V/m 

V-s/A 

V 

W/A 

V 

VIA 

W/A 

1 

N-m 

|/k 

N 

kg-m/s 

Hz 

(cycle)/s 

lx 

lm/m 

cd/m 

lm 

cd-sr 

AJm 

Wb 

Vs 

T 

Wb/m 

A 

VV 

)/s 

Pa 

N/m 

C 

A-s 

1 

N-m 

W'sr 

J/kgK 

Pa 

N/m 

W'm-K 
m s 
Pa-s 

m-s 

V 

W/A 

1 

m 

(wave).m 

N-m 

SI  PREFIXES: 


Multiplication  Factors  Prefix  SI  Symbol 


1 000  000  000  000  = 

10” 

tera 

T 

1 000  000  000  = 

10** 

gig® 

<; 

1 000  000 

10* 

mega 

M 

1 000  =■ 

10' 

kilo 

k 

100  = 

10* 

her  to* 

h 

10  * 

10* 

deka* 

da 

0 1 - 

10-' 

deci* 

d 

0 01  = 

10-* 

( enti* 

( 

0 001 

10-' 

in  i Ili 

Ml 

0 000  001 

10  * 

micro 

0 000  000  001 

10  * 

nano 

II 

0 000  000  000  001 

10  »* 

pirn 

P 

0 ' |()0  ooo  ooo  000  001 

10  •* 

fitmto 

f 

0 000  U00  000  000  000  001 

10  •* 

atto 

n 

• To  be  avoided  where  possible 
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